Approximate Joins for Relational Data
نویسنده
چکیده
Krommydas, Ioannis, Evagelos, Georgia. MSc, Computer Science Department, University of Ioannina, Greece. June, 2008. Approximate Joins for Relational Data. Thesis Supervisor: Vassiliadis Panos. Relational databases often contain duplicate data entries. This may occur due to a variety of reasons, such as typographical errors, multiple conventions for recording database fields or other noise sources. Duplicate detection is a crucial procedure, especially for large databases. In this thesis, we present a method that extends the state-of-the-art method for duplicate detection. Given a database holding valid data information, we classify each input tuple as a new tuple, or as an existing tuple. The proposed method uses an effective algorithm for determining a set of candidate reference tuples. For each candidate reference tuple, we use appropriate similarity metrics in order to decide whether the input tuple matches a reference tuple. The whole procedure is accelerated via trie data structures for caching the frequent input tuples. Finally, we present a number of experiments evaluating the effectiveness of our method and state a comparative study with the state-of-the-art method.
منابع مشابه
Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms
Integrating heterogeneous data from sources as diverse as web pages, digital libraries, knowledge bases, the Semantic Web and databases is an open problem. The ultimate aim of our work is to be able to query such heterogeneous data sources as if their data were conveniently held in a single relational database. Pursuant to this aim, we propose a generalisation of joins from the relational datab...
متن کاملApproximate String Joins
String data is ubiquitous and is commonly used to correlate (or join) entities across autonomous, heterogeneous databases. The main challenge is to effectively deal with the noisy nature of string data, due to, for example, transcription errors, incomplete information, and multiple conventions for recording string valued attributes. Commercial databases do not support approximate string joins d...
متن کاملScoped and Approximate Queries in a Relational Grid Information Service
We are developing a grid information service, RGIS, that is based on the relational data model. RGIS supports complex queries written in SQL that search for compositions (using joins) of resources. For example, we might ask it to find a Linux cluster with a certain bisection bandwidth and total memory. Such queries can be expensive to execute, however, and so we have developed several approache...
متن کاملApproximate String Joins in a Database (Almost) for Free
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not suppo...
متن کاملUsing q-grams in a DBMS for Approximate String Processing
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a ...
متن کامل